A Term Association Translation Model for Naive Bayes Text Classification

نویسندگان

  • Meng-Sung Wu
  • Hsin-Min Wang
چکیده

Text classi cation (TC) has long been an important research topic in information retrieval (IR) related areas. In the literature, the bag-of-words (BoW) model has been widely used to represent a document in text classi cation and many other applications. However, BoW, which ignores the relationships between terms, o ers a rather poor document representation. Some previous research has shown that incorporating language models into the naive Bayes classi er (NBC) can improve the performance of text classi cation. Although the widely used N -gram language models (LM) can exploit the relationships between words to some extent, they cannot model the long-distance dependencies of words. In this paper, we study the term association modeling approach within the translation LM framework for TC. The new model is called the term association translation model (TATM). The innovation is to incorporate term associations into the document model. We employ the term translation model to model such associative terms in the documents. The term association translation model can be learned based on either the joint probability (JP) of the associative terms through the Bayes rule or the mutual information (MI) of the associative terms. The results of TC experiments evaluated on the Reuters-21578 and 20newsgroups corpora demonstrate that the new model implemented in both ways outperforms the standard NBC method and the NBC with a unigram LM.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A New Approach for Text Documents Classification with Invasive Weed Optimization and Naive Bayes Classifier

With the fast increase of the documents, using Text Document Classification (TDC) methods has become a crucial matter. This paper presented a hybrid model of Invasive Weed Optimization (IWO) and Naive Bayes (NB) classifier (IWO-NB) for Feature Selection (FS) in order to reduce the big size of features space in TDC. TDC includes different actions such as text processing, feature extraction, form...

متن کامل

Poisson naive Bayes for text classification with feature weighting

In this paper, we investigate the use of multivariate Poisson model and feature weighting to learn naive Bayes text classifier. Our new naive Bayes text classification model assumes that a document is generated by a multivariate Poisson model while the previous works consider a document as a vector of binary term features based on the presence or absence of each term. We also explore the use of...

متن کامل

A Characterization of Wordnet Features in Boolean Models For Text Classification

Supervised text classification is the task of automatically assigning a category label to a previously unlabeled text document. We start with a collection of pre-labeled examples whose assigned categories are used to build a predictive model for each category. In previous research, incorporating semantic features from the WordNet lexical database is one of many approaches that have been tried t...

متن کامل

A Validation Test Naive Bayesian Classification Algorithm and Probit Regression as Prediction Models for Managerial Overconfidence in Iran's Capital Market

Corporate directors are influenced by overconfidence, which is one of the personality traits of individuals; it may take irrational decisions that will have a significant impact on the company's performance in the long run. The purpose of this paper is to validate and compare the Naive Bayesian Classification algorithm and probit regression in the prediction of Management's overconfident at pre...

متن کامل

Text Classification using Association Rule with a Hybrid Concept of Naive Bayes Classifier and Genetic Algorithm

Text classification is the automated assignment of natural language texts to predefined categories based on their content. Text classification is the primary requirement of text retrieval systems, which retrieve texts in response to a user query, and text understanding systems, which transform text in some way such as producing summaries, answering questions or extracting data. Now a day the de...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2012